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Abstract — This paper investigates universal polar coding 
schemes. In particular, a notion of ordering (called convolu- 
tional path) is introduced between probability distributions to 
determine when a polar compression (or communication) scheme 
designed for one distribution can also succeed for another one. 
The original polar decoding algorithm is also generalized to 
an algorithm allowing to learn information about the source 
distribution using the idea of checkers. These tools are used to 
construct a universal compression algorithm for binary sources, 
operating at the lowest achievable rate (entropy), with low 
complexity and with guaranteed small error probability. 

In a second part of the paper, the problem of sketching 
high dimensional discrete signals which are sparse is approached 
via the polarization technique. It is shown that the number of 
measurements required for perfect recovery is competitive with 
the 0(fclog(n/fc)) bound (with optimal constant for binary 
signals), meanwhile affording a deterministic low complexity 
measurement matrix. 

1. Introduction 

A new technique called 'polarization' has recently been 
introduced by Arikan in Q to construct efficient channel cod- 
ing schemes. The codes resulting from this technique, called 
polar codes, have several nice attributes: (1) they are linear 
codes generated by a low-complexity deterministic matrix (2) 
they can be analyzed mathematically and bounds on the error 
probability (exponential in the square root of the block length) 
can be proved (3) they have a low encoding and decoding 
complexity (4) they allow to reach the Shannon capacity on 
any discrete memoryless channels (DMC). These codes are 
indeed the first codes with low decoding complexity that are 
provably capacity achieving on any DMC. 

Since [3J, the polarization technique has been generalized 
to various settings. For example, it has been used in ifTSl for 
rate-distortion via duality with test channels, in lfT9l . lfT6l . Il20l 
for wiretap channels and information secrecy, and in ll24ll . 121 
for a multi-user problem (multiple accessing). 

In this paper, we investigate the problem of robustness 
of the polar coding schemes with respect to the knowledge 
of source or channel distribution. The perfect knowledge of 
this distribution is never available, and it is important that 
any potentially practical scheme shows some robustness to 
this knowledge. We hence develop several tools to construct 
universal polarization schemes. 

We then consider the problem of sketching high- 
dimensional sparse signal using the polarization technique. 
The hope being to leverage properties (l)-(4) to construct a 
deterministic low-complexity sketching matrix and an efficient 
sparse recovery algorithm. Since the method is defined for 
signals valued in finite sets, it is of interest to lift the 



construction to the real setting. Yet in this paper, we focus 
our attention on the sketching problem for signals that are 
discrete, motivated by applications dealing with such signals. 
This occurs for example in network monitoring problems lfT2l . 
fT3\. We will see that, just like one can exploit sparsity in the 
domain, the sparsity in the magnitude (signals taking values in 
finite sets) can be exploited to develop an efficient sketching 
method via the polarization technique. 

Some results in this paper have been presented in (TJ. 

A. Channel and source polarization 

Arikan shows in fSl that an arbitrary binary input discrete 
memoryless channel W can be polarized as follows: n in- 
dependent uses of W can be transformed into n successive 
uses of synthesized binary input channels that have (except 
for a vanishing fraction) a symmetric capacity which tends to 
either or 1 (with n). In |f23l , this result is generalized to q- 
ary input alphabets where q is prime, and in |2j it is extended 
to q being powers of two (considering g to be a power of 
two has computational advantages, but the case of powers of 
prime follows too). We state here the result of |23 1 for q prime. 
Notation: X" := (Xi, . . . , X„). 

Theorem 1. Let W be a q-ary input discrete memoryless 
channel with q prime, n a power of 2, and let U" be i.i.d. 
uniform random variables on ¥q. Let = t/"G„, where 
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Gn — [li] 

uses of W when the input is X". Then, for any 6 € (0, 1), 



and Y" be the output of n independent 



: I{Uf,Y'W-') > d}\ 



I{W), 



(1) 



where I{W) is the mutual information of W (with a uniform 
input distribution). 

Theorem [T] can then be used to show the following polar- 
ization phenomenon for sources. 

Theorem 2. Let X'^ be n i.i.d. random variables with distri- 
bution p on ¥g, n a power of 2, and let t/" — X"G„, where 
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Then, for any 6 € (0, 1), 



: HmU'-') > S}\ 



H{p), 



(2) 



where H(jp) is the entropy of the distribution p. 



We will see in Section III that previous result follows from 
Theorem [T] via a duality argument. A slightly more general 
result is presented in |5J. 



Note that all entropies and mutual informations are com- 
puted with logarithms in base q (where q is the input or source 
alphabet size). 

A coding scheme from Theorem [7] The limit in the theorem 
implies that for n large enough and except for a vanishing 
fraction of indices i, I{Ui;Y"U"^^^) must be close to either 

or 1. Hence, this suggests a coding scheme: on the indices 

1 for which the channel is good, i.e., I{Ui\Y'^U^~^) is close 
to 1, put uncoded information bits in Ui, and for the other 
indices, put frozen bits that are predetermined and revealed 
to the decoder. This defines the vector [/". Then, the vector 
X" is sent over n independent uses of the channel. Note that 
the rate of this code is given by the logarithm of the number 
of information bits in [/" divided by n, and by Theorem [T] 
this can be made arbitrarily close to I{W). Now the receiver 
knows two things: the locatiorj^ of the indices i containing 
information and frozen bits, and the value of the frozen bits 
(on symmetric channels, the frozen bits can be all chosen to be 
zero). Hence, from the output of X", the receiver starts by 
decoding the first component Ui which is not frozen. By virtue 
of Theorem [T[ one of the two possible value of Ui will have 
(w.h.p.) a probability close to one and hence, the decoder has 
a small probability of decoding Ui incorrectly. This process is 
then iterated to decode successively the entire vector [/". An 
analysis of the scaling of the bit error probability (decoding 
wrongly a component in Ui) allows to conclude that w.h.p. 
errors cannot propagate in this scheme, and hence, this scheme 
achieves the uniform mutual information of the channel. A 
remarkable feature of this coding scheme is that the encoding 
and decoding complexity is shown to be 0{n\ogn). 

A coding scheme from Theorem |2] The limit in the theorem 
implies that for n large enough and except for a vanishing frac- 
tion of indices i, H{Ui\W^^) is close to either or 1. Hence, 
the transformation G„ extracts the randomness in X", which 
is initially uniformly dissipated over the n components, into 
specific components indexed by the i's such that H{Ui\W^^) 
is close to 1. Lossless compression can then be performed as 
follows: from a given source output X", compute U" and store 
the components of [/" which do not have an entropy close to 
0. Note that, from Theorem |2] the compression rate can be 
made arbitrarily close to H{p) (lowest possible rate). For the 
reconstruction, since the components with entropy close to 
can be recovered correctly with high probability given the 
past components, we can proceed successively in an analogue 
manner as for the channel decoding problem. The speed of 
polarization is shown to scale similarly as in the channel 
case, and again, the encoding and decoding complexity of this 
source coding scheme is only O(nlogn). 

B. Goals 

In this paper, we are interested in analyzing how sensitive 
the performance of the previous source/channel coding scheme 

'No analytical formula is known to compute these indices. They are found 
with algorithms, as in 11251 

-To show achievability, the speed of convergence to the polarized channels 
matters, and it is shown to be roughly in (4). 



is to the knowledge of the source/channel distribution. The 
knowledge of source/channel distribution is used at two mo- 
ments for each problem. In the channel coding problem, it is 
first used to identify the location of the "good channels", or 
equivalently, the location of the indices i where the informa- 
tion bits shall be sent. It is then used again in the decoding 
process, to compute the probabilities that an information 
bit Ui is equal each element of (from the polarization 
phenomenon, we know that one of these probabilities should 
have a probability close to 1, but one still needs to compute 
which one it is). Similarly, for source coding, the knowledge 
of the source distribution is first used to find the components 
of U" which must be stored, and then in the reconstruction 
part, to compute the value of each non-stored components. 

We will hence address the problem of constructing polar 
coding schemes which can compress losslessly sources with- 
out requiring perfect knowledge of their distributions, or which 
can communicate reliably over channels without requiring 
perfect knowledge of the channel distribution. The application 



to the channel setting follows then from Section 111 where 



the duality between the source and channel problem is made 
explicit. 

We will then consider the problem of sparse data recovery, 
using polar codes. From the discussion on the source polar- 
ization theorem above, a connection to the sketching problem 
is apparent: if we sense the signal [/" only in the components 
i for which H{Ui\W^^) is close to 1, we obtain a sampling 
of the signal which allows perfect recovery of the full signal, 
with a significantly reduced number of measurements. There 
are however several differences between a compressed sensing 
setting Q, lim and the source polarization setting; in partic- 
ular, in the latter setting the source is random with a known 
distribution and it is valued in a finite field (of arbitrarily 
large cardinality), whereas it is real and with no prior (besides 
sparse) in compressed sensing. Hence, a first question is to 
ask how sparsity, i.e., the property of having many components 
that are 0, is modeled for such random signals, and how much 
the choice of a specific sparse probability distribution matters. 
This part can be investigated using our results on universal 
source polarization, which establishes the connection between 
the two parts of this paper. 

II. Results 

A universal compression algorithm for binary sources is 
introduced in Section V-A Theorem |3] shows that this algo- 
rithm performs at the lowest achievable rate (entropy), with 
a 0(nlog2n) complexity and (roughly) a 0(2^^) error 
probability. 

Partial generalizations are discussed for non-binary sources 
in Section IV-BI 



In Section VI-B a low-complexity deterministic sketching 
matrix is constructed. It is shown in Theorem |4] that for 
fc-sparse signals in F", 0{k\og^n/k) measurements taken 
with the proposed sketching matrix are sufficient to re- 
cover perfectly the original vector with a probability at least 
1 — 0(2^^), and a reconstruction algorithm of complexity 



0{a\og2a ■ n\og2n). An improved version of this Theorem 
(regarding the dependence in a of the constants) is investigated 
in Section NT-E] 

III. Duality between source and channel 

POLARIZATION 

In this section, we connect Theorem [T| and Theorem |2] Let 
p be a distribution on and consider using Theorem [T] for an 
additive noise channel, i.e., Y — X®Z for Z distributed under 
p and independent of X. We then have y" = G„C/" ® Z" 
and 

/([/,; r"C/*"^) = 1 - H{U^\Y''W-^) 

= 1 - i/((G„r" e G„z"),|y"(G„r" e g„z")^-1) 

= l-iJ((G„Z"),|(G„Z")^-i). (3) 

Equality ([3]l uses the fact that is independent of Z" 
because J7", and hence G„J7", are uniformly distributed over 
¥q. We also use the fact that G^^ — G„. Hence, Theorem [l] 
and ([3]) imply Theorem |2] 

Stated as such. Theorem |2] does not imply Theorem [T] 
since additive noise channels are not representative of all 
possible channels. In [SJ a slightly more general result than 
Theorem |2] is stated, where an auxiliary random variable Y 
(side-information), which is a random variable correlated with 
X but not intended to be compressed, is introduced in the 
conditioning of each entropy term. This could be used for the 
reverse implication. 

In this paper, we focus mostly on the source setting, since 
it is the "simplest" setting, hence the one to start with. Using 
previous expansions, the results obtained in the source setting 
will directly admit a counter-part in the channel setting, for 
the case of additive noise channels. 

IV. Defining orderings and mathematical 

preliminaries 

Definition 1 (Measures). Let a be a prime integer. Fa := 
{0, 1, . . . , a — 1} and M(a) be the set of probability measures 
on Fq. For any k E F^, let 

Mk{a) := {p e M(a) : =p(j), Vz, j ^ k,p{k) > 

a 

and M(a) :— UkeWa^kia)- We refer to the measure in M(a) 
as the the spike measures. 

Definition 2 (Matrices). We denote by Doub(a) the set of 
doubly stochastic matrices of size a x a, and by Circ(a) the 
set of circulant stochastic matrices of size a x a. 

Definition 3 (Orders). We define 

Pi-<hP2 if h{pi)>h{p2), (4) 
Pi <d P2 if Pi = Dp2 for D e Doub(a), (5) 
Pi P2 if Pi — Cp2 for G € Circ(a). (6) 

Note that -<j: is the majorization order and pi p2 is 
equivalent to pi = c*p2 for c G M(a), where ★ denotes 
the circular convolution on F^. 



Note that we use the term "order" in a broad sense here 
(not a mathematical order). 

Lemma 1 (Orders hierarchy). 

Pl<cP2 Pl<dP2 ^ Pl<hP2- (7) 

Proof: The first implication follows from the fact that 
Circ(a) C Doub(a) and the second implication follows from 
the Schur-concavity of the entropy 11211 . ■ 
One can easily find examples showing that there is no 
reverse implications in Lemma [T] In this paper, we are 
interested in the -<c order, and previous Lemma gives a first 
idea on how this order compares to the majorization order 
But we will only work with in this paper Also note that 
the set of measures which are worst than a given p e M(a) 
with respect to is given by the convex hull of the orbit of 
p through cycles, whereas it is given by the convex hull of the 
orbit of p through permutations when considering -<rf. Note 
that if p S M(a), these two sets are the same. 

Definition 4. For p e M(a), we define the Fourier transform 
of p by 

a-l 

^(p)H = ^p(fc)e-2-'="/^ uje¥a (8) 

fc=0 

and the inverse Fourier transform of : F^ — > C by 

a— 1 

J-i(/i)(fc) = - V /iHe^"'^'"/^ fee Fa. (9) 
a ^-^ 

ui=0 

Remark \. 

\. T{pi^q) = F{p)F{q) for any p, g G M(a). 

2. If p e Mfc(a) with p{k) = 1 - P, we have that F{p) is 
given by F{p){Q) — 1 and 

n P 

J-(p)H = (1 - ^)e-2"Wa^ ^_,0. (10) 
a — 1 

3. From previous remark, note that (M(a),*) is a semi-group. 

Definition 5. For p e M(a), let DOMc(p) be the set of 
probability measures which dominate p with respect to ^c, 
i.e., DOMc(p) = {q e Mia) : p q}- 

Remark 2. Note that it is easier to describe the set of 
measures that are dominated by a fixed measure p than the 
reverse. However, we can write DOMc(p) ~ {q E M(a) : 
F^^{F{p) / F{q)) > 0}, and we can use the FFT algorithm 
to compute DOMc(p) efficiently. 

Lemma 2. For any a> 1, 

Pl,P2^M{a), Pl<hP2 ^ Pl<cP2- (11) 

Proof: Assume that pi E Mfe(a) and p2 E M;(a) for 
k, I E Fo, and pi -<h P2- Then, denoting 1 — Pi = pi{k) and 

l-P2=P2{ll 

Hpi){^)IHp2){^) - i^|^e-2-('=e»')/^ (12) 

^ a-l 



Hence, if (1 - f^)/(l - f^) e Im(/), where f : P e 
[0, 1] i-^ (1 - aP/{a ~ 1)), we have that ([T2]i is the Fourier 
transform of an element in M(a). This is easily verified since 
Im(/) =^0, 1] and since by assumption 1 — Pi < 1 — ^2- ■ 
Since M(2) — M(2), we have the following corollary. 

Corollary 1. 

Pi,P2eM(2), Pl-<hP2 Pl-<cP2- (13) 

We now introduce one more ordering notion. 

Definition 6. We define for pi,p2 G M (a) 

Pi -<cp P2 iff Pi ~ P2* where u is an infinitely divisible 
probability distribution. 

Definition 7. A probability distribution p e M(a) is infinitely 
divisible if for any k > 1, there exists pk G M(a) such that 

P = 

or equivalently, if T{T{p)'^/^) > 0. 

Note that checking the infinitely divisibility condition for a 
large enough k implies the result for smaller fc's (by grouping 
the pfe's). Hence, denoting e — 1/fc, we need to check that 
J^(pY has a valid inverse Fourier transform when e tends to 0. 
Let z = J-{p) and denote the component of z by zj = rje^^^ . 
Then 

z] = r]e'^^^ = (1 + elog, r,)(l + ieOj) + o{e) 
= l+e{\og^rj +i6'j) +o(e). 

Hence, by the linearity of J-"^^, 

-F-i(z-) - (1, 0, . . . , 0) + eF-\{\og, r, + ze^Y^zl) + o{e) 

and to ensure T^^[z^) > for any e > 0, we need to ensure 
that 

2/(l),...,y(a-l) >0 (14) 
where y = ^-^((log, r, + iO^Trl). 

Note that the dependency in k has been removed in previous 
condition. 

To summarize: we have defined a notion of 'convolution 
ordering', with -<c, where one can reach a distribution from 
another one with a circular convolution, and a notion of 
'convolutional path ordering', with -<cp, where one can reach 
the second distribution with small convolutional steps. 

V. Universality in polarization 

As mentioned in the introduction, there are two parts which 
require the knowledge of the source distribution in the source 
polar coding scheme: one in the compression and one in the 
reconstruction part. We present in this section two lemmas to 
be used for each of these parts in universal results. We start 
with the compression part. 

Definition 8 (Polar storage sets). Let 5 G (0, 1), n a power 
of 2 and p e M(a), 

SsAv) := {i e N : HmU'-^) > 6} 



where U" ^ X"G„, G„ = [ i o]®i°g2(")^ ^„ m ^ 
the notation 

S{pi) 3 S{p2) if Ss,n{Pl) 3 Ss,n{P2) V5 G (0, l),n. 

(We will sometimes call the components of [/" on S 
the information bits.) The reason why we are interested in 
nested storage sets is clear: if one stores the components of a 
source distributed under pi, it will also store the information 
components of any source p2 with S{pi) 3 S{p2) (it will 
consume more rate than required for compressing a source 
under p2 specifically, but it will allow lossless compression for 
both). However, for the reconstruction, it is not clear whereas 
the nested structure is sufficient to induce a universal decod- 
ing process. But let us postpone for now the reconstruction 
problem and focus on the nested structure only. 

Lemma 3. For any a> 1, 

Pl^cP2 ^ 5(p2)C5(pi). (15) 

Proof: By assumption, there exists c G M(a) such that 
Pi = P2 * c. Let X" ^ p2, ^ c independent of X" and 
X" = X" ® Z" - pi. Define f7" = G„X", C7" = G„X" 
and VF" = G„Z", hence C/" = C/" ® W". We have 

H{Ui\W-^)> H{U,\W-'^,W") (16) 
= H{Ui\W-'^,W") (17) 
= H{Ui\U''-^) (18) 

where the last equality follows from the fact that [/" is 
independent of W" since X" is independent of Z". ■ 
Note that the ordering in ( fTSj l indeed holds for all indices. 
We now investigate the reconstruction problem. We first recall 
the decoding algorithm used in |3| 

Definition 9. [polar-dec algorithm 13], Q] 
Inputs: p G M(a), n G Z+, 5 C [n] and u[S] G Ff . 
Output: polar-dec(p, n) G F". 
The algorithm proceeds as follows: 

(0) InitiaHze M ^ S; 

(1) Find the smallest integer z in A^"^ and compute 
Ui = argmax^gF^ Vp{Ui — x\u[M]}; 

(2) Update M = MU{i} and go back to (1) until M [n]; 

(3) Output x" = u"Gn where u" u[M]. 

The term Fp{Ut = x\u[M]} is the probabihty that Ui = x 
when U[M.] is observed, where [/" = X„G„ and X" ^ p. 
It is shown in O, IS) that the computational cost for each 
of these probabilities, as well as the overall algorithm, is 
bounded as 0(n log2 n) (more precisely 0(a^n log2 n) for the 
dependence in a and 0(a log2(a) • nlog2n) if [|2l is used 
and a is power of 2). We refer to lO, ISj for the recursive 
procedure to compute these probabilities, which uses a "divide 
and conquer" procedure based on the Kronecker structure of 

Definition 10. Let pi,p2 M(a), 5 G (0, 1), n > 1, X" ~ 

P2, and X" — polar-dec(pi, C/[55,„(pi)], n), where C/" = 



X"G„. We define 

Lemma 4. For any a > 1, (5 G (0, 1/2), n > 1, 

Pi <cp P2 Pe(pi|p2) < PeiP2\P2)- 

Proof: Fix n and (5 < 1/2. If pi is the uniform distribu- 
tion, Ss.n{pi) — {1, ■ ■ ■ ,n} and the claim is clear: since we 
store all components, the left-hand side error probability is 0. 
Hence, assume that pi is not the uniform distribution. 

Let us assume that a ~ 2, the proof for a > 2 is similar 
Let p2 e M(2) and q = P2* Ip, where Ip = [1 — p, p]. Since 
q -<c P2, we have Ss,n{q) 2 Ss.n{p2) and for the components 
i to be decoded 

S> HmW-^)>H{V,\V'-'), 

where the Ui's (resp. Vi's) are i.i.d. under q (resp. p2)- For 
Wi, . . . ,Wn i.i.d. under p and w,wi, . . . ,Wi^i £ {0,1}, 
define the mapping 

: p ^ P{W, = w\W'-^ = u;'-i). (19) 

Note that fw\iui-^ is continuous over M(2) (with the topology 
induced by M^) for any ui, wi, . . . , Wi-i G {0, 1}. 

Also note that H{Ui\U'-^^) < 6 implies that there exist ^(5) 
with ^{S) ^4" 0, such that for any u'-^ G {0, 1}", 

P{U, = 0\W-^ = u'-i) A P{U^ = = u''-i) < C(e). 

(20) 

Hence, using ( |20] l and the continuity of we have for 

p small enough and any i in the complement of Ss.n{q), 

arg max P(Ui ^ u\W~^ ^ u'~^) (21) 

«G{0,1} 

= arg max P(V, = ulV'^^ = u*"^). (22) 
tie{04} 

For pi P2, we have that pi — p2 I5 for any /c 
where 5 depends on k and is decreasing with k increasing. 
We now want to iterate previous argument, but we have to 
use the continuity of ([T9| at different distribution q's, namely 
q = P2 I5 for Z = 1, . . . , fc. Since this is a compact set 
of q's (the entire path from p2 to pi), we can pick k large 
enough such that the continuity argument remains effective 
along the entire path, and ( |22] i is proved by ( pOj i. It is important 
to assume that pi is not uniform, so as to keep p bounded 
below from 0. 

Finally, from ( p2] i, we have that the algorithm polar-dec 
used with the mismatched distribution still leads to the same 
output than when used with the matched one. Since in the mis- 
matched scenario we observe more components than needed, 
strictly speaking we have an inequality in the error probability 
as in the lemma's statement. 

For a > 3, the proof is identical, except that we are moving 
along p2 V where v is close to a delta function over F^, 
and ( |20] i, resp. ( |22| i, holds when the minimum, resp. maximum, 
include all elements of Fa. ■ 



This result tells us that, if we do the compression and the 
reconstruction using the distribution p\, we can compress and 
reconstruct losslessly any source distribution which are better 
than p\ in terms of <cp- 

If one were to use a compression scheme ignoring any 
complexity considerations, then, simply by knowing that the 
source distribution has an entropy at most R, it would be 
possible to compress and reconstruct the source losslessly, 
using the method of types for example, at rate R. And the set 
\j) e M(a) h{p) < R} are essentially the largest sets which 
can be compressed losslessly at a fixed rate. It is ambitious 
to ask for such a "broad universality" with polar codes, 
since these are structured codes with complexity attributes, 
in contrast to the codes derived with the method of types. We 
may have to give up some extra rate to achieve this goal, or 
we may universally compress only certain subsets of source 
distributions. We now investigate these points. 

A. Results for binary sources 

For binary sources, it is possible to achieve a broad universal 
result with polar coding. 

Notation: For a given < i? < 1, let pq{R),pi{R) be the 
two binary probability distributions such that H{pf){R)) ~ 
Hip,{R)) - R. 

Definition 11 (Universal polar compression algorithm). 

A. Compression: 

Inputs: R E [0, 1] (the rate of compression), S (the target error 

probability), x e Fj (the data). 

Output: V e F'2^^°^"^ (the stored data). 

The compression algorithm proceeds as follows: 

1. Compute u = GnX 

2. Store u[Ss^n{po{R))] and u[n] 

B. Reconstruction: 

Inputs: n, R, u[Ss,,i{pq{R))] and u[n] 

Outputs: polar-dec-adapt (po(P), Pi (^), u[Ss.n{pa{R))], 
u[n\, n) 

Definition 12 (polar-dec-adapt algorithm). 

Inputs: pi, . . . ,pk E M(a), n E Z+, S C [n], T Q S'^ (called 

the set of checkers) and u[T U 5] e F^'+l'^'. 

Output: polar-dec-adapt(pi, . . . u[7^, n) G 

K- 

The algorithm proceeds as follows: 

(1) For i — \, . . . ,k, run u"^.-^ = polar-dec(pj , u[iS], n) 

(2) Find t = argmiiij=i^..._fe d//(uQ)[7^,M[7^) (pick one at 
random for ties) 

(3) Output polar-dec(pt, n). (Variant: output 

polar-dec(pt, u[5 U T],n)) 

Tlieorem 3. [Universal polar compression] Let X" = 
[Xi^...,Xn] be i.i.d. Bernoulli with H{Xi) < R. The 
universal polar compression algorithm allows to compress X" 
at rate R, with error probability 0(2~" ), for any /3 < 1/2, 
and compression/reconstruction complexity 0(nlog2 n). 

(Note: Using the duality argument of Section |in] this 
theorem admits an analogue for universal coding over binary 



symmetric channels.) 

For the proof of this theorem, we show that: 

1 . Any binary source which is known to have entropy at most 
R can be compressed universally with polar codes by storing 
the information bits on S{p*), where p* is one of the two 
distributions with entropy R. 

2. If it is known on which symbol the source distribution puts 
more mass, the source can also be losslessly reconstructed 
with polar- dec using a checker 

3. If it is not known on which symbol the source distribution 
puts more mass, the source can also be losslessly reconstructed 
using the modified decoding algorithm polar-dec-adapt. 

Proof of Theorem |5[ Let D{R) C M(2) be the set of 
binary distributions with entropy at most R, and as before, 
denote by po{R) and pi{R) the two distributions of entropy 
equal to R (assume i? < 1, the result is otherwise trivial). 
Note that, by Corollary [T] for z = 0, 1 

p,[R) ^cD{R), 

and by Lemma [3] 

SsAP^{R)) ^ Ss,n{p), Vp e D{r),S,n. 

Hence, by storing the components on Ss.n{po{R)), we are not 
loosing any information bits. We have to set i5 = (5„ = 2~" 
with a < 1/2 large enough to reach the desired f3 in the 
Theorem. 

Let Di{R), i = 0, 1, be the two regions of D{R) containing 
distributions putting more mass on 0, respectively 1, assuming 
consistent indexing with pq{R) and pi{R). Note that for i = 

0. 1 

p,{R) <cp D,{R). 

Hence, if we know that the source distribution be- 
longs to Dq{R), we can conclude from Lemma [4] that 
polar-dec(p*, u[iS5_„(p*)], n) leads to an exact recovery, 
with error probability at most equal to the error probability 
of the source polar scheme designed with perfect knowledge 
of the source distribution, which is from |5|, 0(2^" ), for 
any /3 < 1/2. From the same paper, we conclude that the 
compression and reconstruction complexity 0(nlog2 n). 

If we do not know whether the true distribution is in Do{R) 
or Di{R), we can learn it as follows. Assume that the type 
of X" is close to its Bernoulli distribution; this is not the 
case with an exponentially small probability. Notice that the 
observed data U[Ss,n{poiR))] corresponds to an equally likely 
string under both a distribution in Do (R) and Di (R), since the 
distribution on Ss^n{po{R)) when X" is drawn under po{R) 
or pi{R) is uniform. Say w.l.o.g. that the true distribution, p^, 
is in Dq{R). If we use po{R) for polar-dec, we will get the 
right output (modulo the error probability). If we use pi{R), 
we will recover X" which is typical under p*, the measure 
obtained by by exchanging the probability mass at and 

1. To see this, note that for a typical X" under p^, X" + 1" is 
typical under p^ (where 1" is the n-dimensional vector filled 
with I's). Moreover, 

1"G„ = [0"-\1], 



and the last component of [/" cannot be in Ss.n{p) (the last 
component of C/" is the one with least conditional entropy) 
unless p is the uniform distribution. Hence X" and X" + 1" 
are both typical upon observing U[Ss.n{po{R))], and we must 
have been able to recover correctly X" or X" + 1" when 
knowing if the true distribution was in Dq{R) or Di{R) and 
decoding with respectively po{R) or pi{R). Hence, by storing 
the value of U[n] (even if it has low entropy) and running the 
algorithm polar-dec-adapt with both po{R) and pi{R), 
we can check which one of the two models provides the 
correct estimate for U[n] and learn whether p^ is in Df){R) or 
Di{R). Indeed, there is no need to run twice the algorithm, 
it is sufficient to run it once and use the value of U[n] 
which had been stored. In any case, we will make en error, if 
polar-dec fails, which happens from |5| with probability 
0(2-" ), for any /3 < 1/2, and the complexity of this scheme 
is 0{n log2 n). ■ 

B. Results for a-ary sources 
Definition 13. For D c M(a), let 



PciD) := arg min H{p), 
Pc{D) := arg _min H{p). 



(23) 
(24) 



In any of the above minimization, if the minimizer is not 
unique, pick one arbitrarily. 

We clearly have that Pc{D) -<h Pc{D); however, it is trivial 
to find Pc{D) while finding Pc{D) is more difficult. 

Let us assume for instance that the exact source distribution 
is unknown for the compression part, but is known for the 
reconstruction part. If the source distribution is known to 
belong to a set D C M(a), one way to construct the 
storage set is to pick S{pc{D)). Then, from Lemma [3] and 
Corollary [T] this retains the information bits of any source 
in D. Of course, this may consume more rate than needed 
with an optimal source code, in other words, if we define 
-ffmax(-D) :— maX)3g£) H{p). we have in general 



H[p,{D)) > H^,,{D). 



(25) 



The inequality can be strict since pi <h P2 does not imply 
in general pi <c P2, and there are examples where equality 
holds in the above, in which case a compression designed for 
Pc{D) requires the minimal rate to compress any source in D. 

Remark 3. Let D C M(a) be such that argmaxpg^i i7(p) 
is unique (denoted ph{D)) and satisfies Ph{D) D. Then 
a source polar code designed for Ph{D) can compress any 
source in D at the lowest achievable rate maxpg£)i/(p) 
without loosing any information bits. 

(Note that Ph{D) D impHes that PrXD) = Ph{D).) The 
set DOMc(p), plotted in Figure [T] satisfies (by definition) 
the condition of Remark [3] for any p. Comparing Figure [T] 
with the plot of DOM;i(p) := {q G m(a) q <h p} also 
shows that there are sets for which ( |25] l holds with a strict 
inequality. One can also check how much rate is lost by 




Fig. 1. Plots of DOMh([0.2,0.2,0.6]) (red region) included in 
DOMc([0.2,0.4,0.4]) (blue region). 



compressing for a distribution that is dominated in terms of 
as opposed to for example the gap between the rate 
needed to compress Bji := {p e M(a) : H{p) < R} using 
Pc{Bfj) and the minimal rate R needed with the method of 
types, i.e., H{pf.{Bp)) — R. This gap can be computed using 
Remark |2] in the case of Figure [T] for example, it is 0.095 for 
R=0.865 and a = 3. 

Also note that pi -^c P2 may not imply S{p2) ^ S{pi), 
and there may be other ways to construct storage sets which 
contain the information bits for several distributions (than 



using -<c). We investigate this point in Section VI-E and now 
move to the decoding part for a-ary sources. 



Definition 14. For D C M(a), let 



(26) 



Pcp{D) arg min H{p). 

peM{a):p-<^,,D 

If the minimizer is not unique, pick one arbitrarily. 

The following follows by definitions. 

Lemma 5. A source distribution known to belong to a set D d 
M(a) can be compressed and reconstructed losslessly at rate 
H{Pcp{D)), using a polar code designed for the distribution 
Pcp{D). 

C. Non-universality of a-ary source polar codes 

In this section, we show that in general, polar codes cannot 
achieve the lowest rate for lossless compression of compound 
sources when a > 3, no matter how the storages sets are 
constructed (i.e., not necessarily via -<c). A similar result has 
been derived in ifTSl for channel coding, however, it is not 
possible to leverage the counter-example found in |15| to 
the source case (since the channel polarization results have 
a source counter-part only for additive noise channels, and the 
counter-example in 1 15 1 does not use only with additive noise 
channels). In this section, we assume that for the decoding 
part, we have the aid of a genie that provides the exact source 
distribution. 



We consider two source distributions p and g on Fj,, and we 
are interested in finding the rates at which one can compress 
these two sources without loosing the information bits of any 
of them. We denote by Cpoi (p, q) the infimum of these rates, 
and we provide different bounds on this quantity. Clearly 

C[p,q) :^H{p)WH{q)<C,,,{p,q). 

From previous section, we have the upper bound (7poi {p, q) < 
H{pc{p,q)), where Pc{p,q) is as defined in ( [23] ) for the set 
D = {p, q}. In our definition, Cpoi(p, q) is given by the limit 
inferior of ^ 

-\Ss{p)USs{q)\. 
n 

Let n = 2^, [/" = G„X" where X" is i.i.d. under p, and 
V" — GnY^ where Y" is i.i.d. under q. Let us also denote by 
P (resp. Q) the additive noise channel whose noise distribution 



is p (resp. q). We then have from Section III 



i-/(g,) (27) 



where Pi (resp. Qi) are the channels corresponding to P'^ for 
a E {—,+Y, as defined in |3| with the tree construction. 
Moreover, if we define for S £ (0,1) Gs{P) = {« G 
{!,..., n} : I{P,) > S}, we have 

Ssip) U Ss{q) = (GsiP) n GsiQ))' . (28) 
This shows that the compound capacity for source or channel 



coding are related and we can use the result of Section III and 
Theorem 5 in ifTSi to get the following bounds. 



Lemma 6. 

Cpo,{p,q)<Yi E I{BEC{Z{P'^))V I {Z{Q'^))) (29) 
Cpol{p,q)>y, E H{p")yH{q^) 



(30) 



where P (resp. Q) is the additive noise channel with noise 
distribution p ( resp. q). Moreover each bound is monotonically 
approaching Cp„i{p, q). 

Note that the upper bound is straightforward, and the 
notation H{p'^) refers to _ff (C/i|C/*^^) for the index i corre- 
sponding to (J. It is interesting to note that if BECs can be 
used to compute previous bounds, we cannot use the counter- 
example of 1.1 51 to show that polar codes do not achieve 
compound capacity in source coding, since BECs do not 
correspond to a valid source distribution via the duality of 



Section III However, we can use the duality and BECs to 
construct storage sets which are included in Ss{p) ^Ss{q), in 
a different manner than done in previous section. Let us give 
an example with £ = 1. For two source distributions p and q, 
consider finding the BECs with parameter Z{P) and Z{Q) (P 
and Q as defined above). Then, as in ifTSl . the good indices 
for P and Q satisfy 

QiP) n GiQ) D g(BEC(z(P))) n g{BEC{z{Q))) (3i) 

= g{BEC{Z{P) V Z{Q))) (32) 



and from (|28]), g(BEC{Z{P) V Z{Q))) gives a storage set 
to compress p and q without loosing information bits. This 
provides an interesting and different approach to constructing 
universal polar codes, although it may not be practical and 
has the drawback of requiring the source distribution for the 
reconstruction (as opposed to the -<cp ordering). In a work 
in progress, we propose the use of spike measures M(a) to 
replace the "worst BECs" directly with "worst source distri- 
butions". The common feature between the spike measures 
and BECs is that they are both families that have a nested 
structures for the storage/good index sets and that span the 
whole range of entropy/mutual information between and 1. 
Also note that as opposed to the channel polarization case, 
degradedness in source polarization is less restrictive, since 
there are less degrees of freedom for source distributions than 
channels. 

Now, to show that polar codes do not achieve the compound 
capacity in source coding, we can still use the lower bound 
of Lemma |6] but we need to pick two source distributions on 
ternary source alphabets. 

Proposition 1. Polar codes do not achieve the compound 
capacity for source coding when the source alphabet has 
strictly more than 2 elements. 

Counter-example: Let p = [0.08,0.36,0.56], q = 
[0.11,0.62,0.27], such that H{p) 0.8143, H{q) = 0.8126 
and C = H[p) V H{q) = 0.8143. The LHS of Lemma |6] for 
i — 1 evaluates at 0.8174 which is strictly larger than C. 

VI. Sketching and sparse recovery 

In compressed sensing (CS), a /c-sparse signal of high 
dimensionality n can be recovered with overwhelming prob- 
ability from a small number of random measurements m = 
0{k\og{n/k)) with a convex optimization method Q, ifTTl . 
If the use of random measurement matrices simplifies the 
mathematical analysis, a drawback is that they have a heavy 
structure and it there is no efficient way to check if a given 
matrix realization satisfies the desired property (RIP) for the 
reconstruction (although one can show that this happens with 
high probability). Other drawbacks of random sensing matrices 
are discussed in f\n\, It has hence become a challenging 
problem to construct explicit matrices that, yet, can perform 
competitively (in terms of measurement rate) with the random 
ones. Different deterministic matrices have been proposed in 
the literature, but in [8|, [221, |10| the number of rows is at 
least quadratic in k and in ||9j one needs r2(n) bits to specify 
a matrix entry. In iflTI , binary matrices with m brought down 
to fc20(i°gi°s")'' = fcn°(i), with S > 1, are proposed and in 
lISl, a rather general condition for constructing deterministic 
matrices satisfying a statistical restricted isometric property 
(STRIP) is given. 

In this section, we are interested in designing an explicit 
measurement matrix using the polarization technique. The 
motivation being that the matrix used in previous section 
for polar source compression is deterministic and easily con- 
structed. Of course, in the basic source compression problem. 



the compression matrix is designed adaptively to the source 
distribution, whereas the CS results are universal as long as the 
signal is sparse. Hence, we would like to construct an explicit 
matrix with the polarization technique that is also universal. 
The tools of previous section for universal polarization will 
hence be used. 

Note that there are a few more distinctions between CS 
and the problems of previous sections. First, the source in 
our case is a random process, whereas in the original works 
on compressed sensing, the signal is deterministic. The case 
of random signals has been considered in several subsequent 
works for compressed sensing, such as in |6|. Another im- 
portant difference, is that the source in our setting is valued 
in Fa, as opposed to M for arbitrary sparse signals. One way 
to address this problem is via quantization, which requires 
a careful treatment. It is related to the fact that in the CS 
setting, measurements can be done with arbitrary precision 
whereas in our setting they are quantified in bits. On the 
other hand, we focus here on applications where the signal 
is valued in a discrete set to start with, such as in certain 
network monitoring problems [12], [ISj . For example, if one 
wishes to track the number of packets flowing between the 
different IP addresses of a network (e.g. to detect unusual 
behaviors), the state vector can be of dimension up to 2^^. 
Since it is not feasible to maintain such a huge dimensional 
vector, one wishes to use a much smaller sketch vector that is 
still carrying all the significant information of the state vector, 
by exploiting the fact that the state vector is sparse. We hence 
keep such applications as our motivation and focus mainly on 
the sketching (sensing) and sparse recovery of such discrete 
signals. A possible lifting of the results in this paper to the 
real field setting is investigated in a work in progress. 

Before attacking the problem of constructing a universal 
deterministic sketching matrix via the polarization technique, 
we consider a specific example by choosing a particular 
distribution for the signal to get started. 

A. Assuming knowledge of the signal distribution 

In this section, we assume that the distribution of the signal 
is known. The case of unknown distributions is discussed in 
Section fVI-B| Assume that Xi, . . . ,Xn are i.i.d. under the 
following spike distribution 

p, := (1 - e, e/{a - 1), ... , e/(a - 1)) G Mo(a). (33) 

Note that for n i.i.d. samples drawn under p^, the number of 
non-zero components is in expectation ne. 

Definition 15 (Polar sketching matrix for a single distribution). 

Let 5 G (0, 1) and (j)s^n{Pe) = Is • Gn be the matrix obtained 
by deleting the rows of Gn which are not indexed by 5 = 
Ss^niPe) (cf. Definition [sjl. 

Rephrasing the source polarization result, we obtain the 
following. 

Lemma 7. Let n be a power of 2, X be an n-dimensional 
vector drawn i.i.d. under p^, and let (j) — <j)Sr,,n{pe) be the 



and Sn — 2 with 
(cf. Definition 75 L For any a € (0, 1/2), there 



polar sketching matrix defined for p, 
13 e (0,1/2 



exists /? g (0, 1/2) such that the number of rows of 4> is given 
by 



n(l-e)logJ^-^) 



nelogj — ) + 0(2- 



we can recover 



and using the polar decoding algorithm for p^^ 
X from Y — (j)X with probability 0(2~" ) and with a 
complexity bonded as 0{a^nlog2'n) (and if a is a power of 
2, the complexity can be reduced to 0(alog2 a ■ rilog2 n) by 
using the approach in /!2V.j. 

Discussion: Note that m is simply the cardinality of S, 
which is approximately nH{p^). Defining ne — k, we have 



m = fclog^ - +o(log^ -) 



1 



loe 



-fcloge 7 +0(l0ge 7). 

a K k 



(34) 
(35) 



This expression is similar to the O(fclogg^) expression 
encountered in the CS literature (|7|) for the number of 
measurements. It is even a tighter form since the constant 
is less than 2; of course, for the reasons discussed at the 



beginning of Section VI the comparison is inappropriate, since 
(in particular) we are modifying the assumption on the signal: 
it is drawn from a specific known distribution. The reason why 
the number of measurements decreases when a increases may 
seem strange; however notice that a measurement for signals 
in Fa is made with a precision of a bits. Hence, to compare 
the number of measurements for different values of a, one 
should use the same unit for the measurement. Let us check 
how the number of measurements scale with a. Rewriting (|34]) 
with the dependency in a, we have 

m = /c(l + logJa-l)) + fcloga- =2fc + o,(l). (36) 

Hence, if we allow infinite precision for the measurements, 
for large a we only need 2k measurements, but of course, the 
complexity blows up. If we express all measurements in nats, 
we have 



m ^ k{l + loge(a - 1)) + fclogg 



(37) 



and m, as a function of a, grows like k logg a. 

Of course. Lemma |7] requires knowledge of the signal 
distribution whereas CS results are universal. There is no 
reason to assume that ( |33] l is distribution of the signal. 
The exact knowledge of the source distribution is in general 
unrealistic, and as discussed in previous section, even with an 
estimate of the distribution, it is crucial to show at least some 
robustness with respect to possible mismatched distributions. 
For applications, it may actually be interesting to have adaptive 
results, but this is also changing the rules of the game. We now 
investigate the universality problem. 




Fig. 2. The simplex with Spa(3, e) (lower triangle, in red) for e 1/5 and 
the spike measures at 0. namely Mo(3) (middle line, in blue). 



B. Universal prior 

A possible way of defining fc-sparse random sources, is to 
ask that the source distribution leads to an expected number 
of at most k non-zero values. Specifically, let a be a prime 
number and let Fa = {0, . . . , a- 1}. Let X" = (Xi, . . . , X„) 
be i.i.d. samples from a distribution /i, with /i(0) — I ~ e. 
Then, the number K{X") of components of X" which are 
not equal to is in expectation 



Let 



Ei^(X") = ne. (38) 



Spa(a, e) := {/i G M(a) : ^(0) >l-e}, (39) 



and consider samples X" = (Xi, . . . , X„) that are i.i.d. from 
a distribution in Spa(a, e). From previous remark, the number 
of components in X" that are not equal to is bounded by 
ne. If we are interested in e as a measure of sparsity, then 
K{X^)/n concentrates exponentially fast around e. The set 
Spa(a,e) is pictured in Figure |3] for a — 3. 

The results that we will derive do not depend on the fact 
that is the special value for the distributions in Spa(a, e), in 
other words, we could equally well consider sources that are 
sparse with respect to an arbitrary i e Fa. For simplicity, we 
stick with i = for now, although considering arbitrary i's 
may be useful when dealing with the problem of quantizing a 
signal to Fq. Also note that Spa(a,e) contains sources which 
can be supported on any subset of F^ (e.g., a may be large 
but this set still contains sparse binary sources). It may be 
reasonable to assume that there is no such variation in the 
probability mass assigned to the non-special values, this will 
be discussed later 

Remark 4. From an information-theoretic point of view, we 
can ask the question of finding the smallest rate at which one 
could compress a source whose distribution is in Spa(a, e) 
without any further knowledge on the distribution (and irre- 
spectively of the compression scheme employed). As discussed 



in Section |V] the answer to this question is given by the 
maximal entropy that can be reached with a distribution in 
Spa(a, e). It turns out that the distribution with maximal 
entropy is precisely ( |33] l, as in Section |VI-A| Hence, theo- 
retically, the strong performances presented in Section |VI-A| 
can still hold in the universal setting. However, the scheme 
used to achieve such Shannon limiting performance may be 
highly complex, whereas the whole point here, is to consider 
explicit schemes of low complexity. 

Theorem 4. Let X^, with n a power of 2, be an n-sample 
drawn i.i.d. from a distribution which has at most e mass on 
the non-zero entries ofVa, and let (f)(a, e) be the mx n polar 
sketching matrix constructed deterministically for a and e (cf. 
Definition |i6p . We have 

Tl 

m — C{a,e) ■ klog^ — + 0{k), k — ne, 



with 



lim C(a, e] 



lege a 



and with probability 1 - 0(2-" ), /3 e (0,1/2), X" can 
be exactly reconstructed from (pX^ using the polar decoding 
algorithm (cf. Remark^ with a complexity o/ 0(a^n log2 n) 
(or 0{a log2 a ■ n logj n) if a is a power of 2 and [2] is used). 

Remark 5. 

1 . The polar decoding algorithm (cf. Definition |9]) must be 
evaluated as 

polar-dec(pcp(Spa(a, e)), 0a;"[55,„(pcp(Spa(a, s)))],n) 

for 6 = 6n — 2^"° with a < 1/2 large enough to reach 
the desired /3 in the Theorem, and where pcp{Spa{a, e)) is the 
distribution of minimal entropy that is dominated by the entire 
set Spa(a, e) for as in ( p6| ), and S as in Definition (|8]l. 

2. The multiplication is carried out over Fq. 

3. The same result holds if the distribution of X" has at most 
£ mass on an arbitrary i e F^. 



4. In Section VI-E we discuss improvements of the constant 
C. 

Definition 16. Given a set D of probability measures on 
Fo, we construct a sketching matrix 4'{D) of dimension n 
as follows: 

(i) Find Pcp{D) as defined in < \26\ 

(ii) Find S = Ss,n{PciD)) as in DefinitionJ^for <6 <1 

(iii) Define 4> — IgGn, where Gn — [ol] " and where 
Is is the matrix whose columns indexed by S form the 
identity matrix and whose other columns are filled in with 
zeros. Note that is an m x n matrix, where m — \S\. 

In particular, we define (j>{a,e) :— (/)(Spa(a, e)) and to have 
the optimal error decay we pick S — 2"" with a < 1/2. 

Implementation of (j). 

1. Step (i) can be easily computed, cf. Remark [2] and the proof 
below. 

2. Step (ii) requires a comment: finding S with an analytic 



formula is a hard open problem in polar codes. However, it is 
mostly a mathematical challenge, since one can run simula- 
tions to determine S with a good accuracy, or find arbitrarily 
tight bounds on the entropy terms in S in polynomial time 

m. 

3. The construction of G„ is straightforward because of its 
Kronecker structure, which also allows an efficient decoding 
algorithm running in 0(nlog2 n). 

C. Interpretation of Theorem |4] 
In view of 

lim C(a,£) = ° , 
e^o loge(a) 

for a fixed small e, the quantization level a should be at most 
1/e in order to have a dimensionality reduction. Hence, if 
a signal sparser in its domain than its magnitude, where we 
define the magnitude-sparsity of a signal taking a possible 
values by l/(a — 1), and the domain-sparsity as before by 
£ — k/n, then the approach of Theorem |4] gives interesting 
results. 

For a small, this is interesting for most e. In particular for 
a = 2, the sparsity in magnitude is maximal, namely 1, and 
for any e we have an optimal dimensionality reduction 

m = 1.44 • fclogg(n/fc), 

where the optimality refers here not only to the order 
fclogg(n/fc) but also to the constant 1.44 (recall that the mea- 
surements are taken in bits). By Shannon, one cannot further 
improve this bound (even with schemes of high complexity). 

For a large, e.g. a — 257, we get reasonable dimensionality 
reduction for very sparse data, for example, if e = 10"^ 
and n — 10^, we get a reduction of 68% for the number of 
measurements (compared to n). But for a — 257 and e 0.1, 
there is almost no dimensionality reduction. However, we will 
see in next section that this is due to the analysis employed in 
the proof of Theorem [4] rather than the use of the polar matrix. 

D. Proof of Theorem |4] 

Proof The number of measurements m is given 
by nH{pcp{Spa{a, e))) + o{n). Note that by symmetry, 
Pcp(Spa(a, e)) is a spike measure (i.e., an element of Mo(a)). 
We have 

Pcp(Spa(a, e)) = (1 - r]{e),r]{e)/{q - 1), ... , 77(e)/ (q - 1)) 
where 77(e) is the smallest positive rj ensuring 

(1 - T], r]/{q - 1), ... , i]/{q - 1)) ^^p P 
for any p G Spa(a,e). Moreover, it is sufficient to check 

(1 -77,77/(9- l),...,7?/(g-l)) ^ep (l-e,e,0,...,0), 
i.e., 

/ ^(l-7?,77/(g-l),...,77/(g-l)) y/^ ^ 
J-(l-e,e,0,...,0) 



for any A: > 1. Using ( [42| i, the dependence in k can also be 
removed. Defining 



^(l-e,e,0,...,0) 

(1 - e + ee-^-^^'/'^)'}-^ ' 



(40) 
(41) 



and denoting the component of z by zj = rje \ we need to 
ensure 



y(l),...,y(a-l) >0 

where y = T-\{\og,r, + le^Y^zD 



(42) 



Numerically, one can then easily find 77(e) by means of the 
FFT algorithm. In Figure |4] we have plotted e M- 7/(e) for 
different values of a. Note that one can also find analytically 
77(e) using the following approach. Assume a — 2>. Let us 
first find Pc(Spa(3, e)). Here also, we have pc(Spa(3,e)) = 
(1 — 77(e), 77(e)/2, 77(e)/2) where we need to find ^(e). Note 
that all distributions that are worst than (1 — e, e, 0) for -<c are 
given by the convex hull of the orbit of (1 — e, e, 0) though 
cycles, that is hull((l - e,e, 0),(0, 1 - e,e),(e, 0, 1 - e)). 
Hence, the projection pcp of (1 — e, e, 0), i.e., the distribution 
in this convex hull which belongs to the spike measures 
and has minimal entropy is found by taking the intersection 
between the line connecting (1 — e,e,0) to (e,0, 1 — e) and 
the line of spike measures parametrized by (1 — d, (i/2, d/2). 
An elementary computation yields 



^(e) = 1 - 2e(l - e) = 1 - 2e + o(e). 



(43) 



Note that the scaling 1 — 2e+o(e) is clear, since for small e the 
line connecting (1 — e, e, 0) to (e, 0, 1 — e) is almost parallel to 
the line connecting (1, 0, 0) to (0, 0, 1). Indeed, one can easily 
generalizes this for a > 3 to 



77a(e) = 1 - (a - l)e + o(e). 



(44) 



To find the projection pcp of (1 — e,e,0), we need to 
move from (1 — e,e,0) towards spike measures with tiny 
convolutional steps. But once we have made a small step in 
the direction (1 — e,e, 0) — (e, 0, 1 — e) to reach {x,y,z), 
we need to move next in the rotated picture, i.e., in the 
direction {x,y,z) — {y,z,x), as illustrated in Figure ??. 
Defining f{x) = x + 7(11 — I)x, where 11 = o(0, 1, 0), we 
are interested in f^{x) where a; = (1 — e,e,0). Hence, we 
look for where A = / + 7(0 - /). Since 11 is circulant, 
so is A and the eigenvector of A are the Fourier (DFT) basis 
elements and the eigenvalues arel + 7(Ai — 1), where are 
the corresponding 3 roots of unity. Therefore, the eigenvalues 
of A^ are [1 + 7(Ai — 1)]*^, and keeping = c, we obtain 



lim [1 + c/k{\ - l)f ^ exp(c(Ai - 1)). 



Hence, 



r(c) = F3 diag(exp(c(A, - 1)))F3*(1 - e, e, 0)*, 

for c > 0, and where F3 is the Fourier (DFT) matrix of 
dimension 3, parametrizes the path starting at (1 — e,e, 0) 




Fig. 3. The simplex with the pc projection (first point in red) and the pcp 
projection (second point in blue) of [1 — e, 0, e\. 



and obtained with incremental convolutional steps which are 
"targeting" spike measures. Equating the second and first 
components of Tc, i.e., solving r(c)2 — t{c)j, gives a closed 
form expression for c and for 77(e) — 2r(c)2, and we get as 
for 7^(e), 



7/(e) = 2e + o(e). 



(45) 



That is, for small e, the penalty endured by considering the pcp 
projection rather than the pc one (for Spa) is negligible. This 
is not surprising, since for e small, the path from (1 — e, e, 0) 
to spike measures is anyway small (as required for the p^p 
projection). With similar arguments, we conclude that for any 



a. 



(46) 



77a(e) = (a - l)e + o(e). 
Finally, we need to evaluate H{prii^e)) where 

Pn(e) = (1 - (a - 1), • • • , ?7(e)/(a - !))• 



Note we can compare the cost for universality of a low com- 
plexity scheme (obtained with the pcp analysis and the polar 
matrix) with respect to the limiting performance (Shannon): 
instead of nH{p^) we need nH{p(^a.-i)e) measurements, when 
e is small. For a ~ 2, these two are identical, and this is 
consistent with Theorem [3] For arbitrary a, we get 



loggO e 



as opposed to 



H{jp,) = -^e\og,^-+0{e). 
logg a e 



(47) 



(48) 
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Fig. 4. Plots of e I— r]{e) for different values of a = q. Note tliat (a — l)e 
is an upper bound to r;(e), tight for small e as shown in the proof of Theorem 
|4] Hence for e not too small ri{s) provides a better measurement rate than 
what is obtained with the crude bound of Theorem |4] 



0.6- 
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Fig. 5. Plots of £ H-> r;* (e) for different values of a = q 



E. Improving Theorem |4] 

In Remark |4] we concluded that the minimal number of 
measurements (in a-ary bits) needed to recover a source 
from Spa(a,£) is given by nH{p^) where is the spike 
measure with mass 1 — e at and e/{a — 1) elsewhere. 
To approach this performance with a polarization scheme of 
low complexity is challenging. We have shown that, for the 
universal problem of recovering sources from any distribution 
in Spa(a,e), one could still use the adaptive setting but 
designed for a specific distribution; namely pcp(Spa(a, e)). 
This distribution is dominated by the entire set Spa(a. e) 
with respect to -<cp, and we showed that because of this 
property, a scheme designed for pcp(Spa(a, e)) guarantees 
successful recovery for any distributions in Spa(a, e). In a 
sense, pcp(Spa(a, e)) is the worst case scenario. Of course, 
there may be other ways (than using ^cp) to order distributions 
and find a worst case distribution which has lower entropy 
(one may attempt to replace -<c with ^d). There may even 
be other ways of tackling the universal problem than looking 
for an ordering and a worst case distribution. The advantage 



of using a worst case approach, is that we can then inherit 
the complexity attributes and convergence rate property from 
the adaptive setting. Ordering also allows to give a 'hierarchy' 
between the different distributions, and helps designing robust 
schemes (by backing off from the estimated distribution to 
guarantee performances). 

One point is that it may not be needed to consider the 
entire set Spa(a, e) for a given problem. For example, the 
set of distributions that are dominated by pcp(Spa(a, e)) (w.r. 
to -<cp), is already quite large and contains most distributions 
of sparsity e. It does not contain the distributions of sparsity e 
which have small supports, but if these can be ruled out, then 
we can bring back the constant C(a,£) to l/logg(a), which 
is the Shannon limiting performance. 

We now discuss another approach to construct a universal 
sketching and reconstruction method for Spa(a,£). A before, 
there are two parts to discuss. First the sketching, i.e., to know 
which rows of Gn can be deleted without loosing information 
about X". Then the reconstruction, i.e., to know how to run 
a decoding algorithm that ignores the exact distribution. 

1) Sketching: Here is a brut-force approach to achieve 
universal sketching. 

Deflnition 17 (brut-univ-sketching algorithm). 
Inputs: e (the sparsity degree), a (the size of the signal 
alphabet), n (the dimension). 
Outputs: 7f{e). 

We present two variants of the algorithm. 
Variant A: 

For Tj from e until 77(e) (with a given step size); 

if S{pri) 3 S{q) for any q in the convex hull of (1 — 

£,£,0,...,0), (l-£,0,£,0,...,0),...,(l-e,0,...,0,e); 

output 77; 
otherwise increase the step size. 
Variant B: 

For 77 from e until 77(£) (with a given step size); 
if D 5((7('^)) where g(^) = (1 - £, £, 0, . . . , 0); 

output 77; 
otherwise increase the step size. 

Note: one could also consider a dichotomic approach for 
the search of 77* (e), and by symmetry, one can restrict the 
search of q to only one portion of the convex hull. One also 
has to specify the precision ^ for the computations of the 
sets S — S^,n, we omitted it in the algorithm to simplify the 
notation. 

Variant B has low complexity, since it conducts a search in 
a one-dimensional space (for 77) and since the computation of 
S can be done at low computational costs. Variant A requires 
a larger search for q, which can be constraining for a large. 

Result: Variant A of brut-univ-sketching provides 
77* (£) such that 

S {p,f (e))^S (p) , Vp e Spa(a, £) . 

Conjecture: Variant B of brut-univ-sketching leads to 
the same output than Variant A. 



In Figure |5] we show rj* (e) (obtained with Variant B of 
brut-univ-sketching) and the Shannon limit consisting 
of the diagonal, and reached for a = 2. As observed, the 
improvement is significant with respect to r/(e) (obtained 
with the Pep projection). Indeed, this brings the number of 
measurement very close to the optimal performance. Emre 
Telatar is gratefully acknowledged for his help in producing 
these plots. 

For the decoding part, there are no guarantees that decoding 
with p^, allows a correct recovery. One can use the algo- 
rithm polar-dec-adapt to learn the distribution, but one 
needs to first add checkers in the set of stored components. 
Checkers are components that need not to be stored (because 
they have low conditional entropy) but that we still store to 
help the decoder get information about the source distributions. 
As long as the number of checkers is o(n), the measurement 
rate is not affected. 

2) Reconstruction: We now proceed to use 
polar-dec-adapt to decoder the sensed components of 
previous part. We first proceed to a patching of Spa(a,e). 
Consider a uniform discretization of the convex hull of 
(1 - e, e, 0, . . . , 0), (1 - £, 0, e, 0, . . . , 0),...,(1 - e, 0, . . . , 0, s). 
Enumerate a uniform discretization of this convex hull as 
Dk := {pi, ■ ■ ■ ,Pd}- Call Spa^(a,e) the sets of distributions 
that dominates any of the elements in Dk with respect to 
^cp- We then have 

Spa^(a,e) — >■ Spa(a, e), 

meaning that the set Spa.j^{a,e) is dense in Spa(a,e). For a 
targeted e, we then pick a e' slightly larger than e and a d 
large enough such that 

Spa^(a,£') 3 Spa(a,£). 

We then use polar-dec-adapt with the output of 
brut-univ-sketching and o(n) checkers to learn which 
of the distributions pk is a good 'model' for the sensed 
data. The term model is used because by construction of 
Spaj^{a, e'), the sensed data on the checkers must look typical 
with at least one of the p^'s, although there might be more 
than one, and although none of these pi's may be the true 
distribution of X" (but they will be dominated with respect 



to -< 



cp 



by the true distribution, which is good enough to 
ensure correct decoding). One has to pick e' — e small 
enough and d large enough to ensure a small increase in the 
number of measurements (one needs to study the scaling of 
these parameters for a more precise statement). The overall 
complexity of this decoding scheme scales multiplicatively 
with d. Hence, as long as d is not of the order of n logj n, the 
overall complexity remains low. 

In a work in progress, we also consider another approach 
for universal decoding via an algebraic characterization of the 
possible likelihood ratios computed in polar-dec. 



VII. Discussion and extensions 

A. Universal polar coding 

We summarize here two ideas introduced in this paper to 
construct universal polar coding schemes 

1) Convolutional path ordering: to tell when a polar coding 
scheme designed for one distribution can succeed for 
another one 

2) Checkers: to learn some information about the distri- 
bution by storing components that did not need to be 
stored 

In particular, we developed an algorithm which allows to 
compress universally binary sources at the lowest achievable 
rate, with low complexity and with guaranteed low error 
probability. 

B. Sparse recovery and sketching 

We applied the tools developed for universal polar coding 
to the problem of sketching sparse signals, constructing a 
deterministic sketching matrix by deleting appropriate rows 
of the polar matrix Gn- We summarize here some conclusions 
and extensions on this approach. 

1) An sketching method tuned to discrete signals. 

Compressed sensing exploits the sparsity of signals in 
their domain to acquire them efficiently. If for the 
application of interest, the signal is also sparse in its 
magnitude, that is, if it takes values in a set of small 
cardinality, this can also be exploited as shown in this 
paper For example, if the signal is binary, we developed 
a sketching method with a deterministic low complexity 
matrix, an optimal number of measurements (for the 
scaling and the constant) and a low complexity recovery 
algorithm with a proved exponentially small (in ^/n) 
probability of error We extended this results to a-ary 
vectors, noticing a better fit for small a and proposing 



2) 



an improved approach for larger a (Section VI-Ei. We 
also underline that, for a given application, the method 
proposed here can be used adaptively by designing an 
appropriate probabilistic model for the signal. This can 
improve the measurement rate. 
Lifting this work to the reals? 

Most works in the CS literature constructing explicit 
sensing matrices are based on algebraic constructions 
ifTTl . flOJ . |i6J. In these works, matrix acting on the 
reals can then be obtained. Of course, we also made 
the point (previous item) that for certain application, it 
may be more natural to work with the discrete setting 
directly. Yet, an interesting extension would be to study 
a lifting of our results to the real case. A possible 
approach would be via a quantization procedure, where 
problems of robustness to noise must be investigated. 
Another possible problem would be to attempt detecting 
the signal support only (which is a binary signal). 
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